Swiftkey develop a word prediction application that is used while typing into a keyboards on a mobile keyboard. When the user types: “I went to the” : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.
In this project, we use R to build a predictive model using text data provided by the Data Science Capstone course. The data consists of text from ‘Blogs’, ‘News’ and ‘Twitter’ totaling more than 4 million lines and ??? unique words.
In a nutshell, here’s a summary of the data analysis performed in this report.
First, we sample 1% of the lines in the files in order to speed up the data exploration. The implementation is in sample_capstone_data in sample_data.R. We use tm R package to load each sample file for analysis.
| source | num_lines | num_unique_words | mean_word_freq | median_word_freq | |
|---|---|---|---|---|---|
| 1 | 23602 | 8040 | 20 | 9 | |
| 2 | blogs | 8993 | 12414 | 15 | 6 |
| 3 | news | 10103 | 12850 | 15 | 7 |
We perform the following text processing steps prior to parsing ngrams.
For example, look at the word frequency distribution for the sample data
p <- all_docs_word_plot(sample_vector_corpus)
print(p)
Let’s load all the data sources into 1 corpus.
docs <- load_sample_dircorpus()
docs <- preprocess_entries(docs)
Here are top bigrams.
ngram_2 <- get_docterm_matrix(docs, 2)
p2 <- generate_word_frequency_plot(ngram_2$wf, "Top Bigrams for Sampled Text")
print(p2)
Here are top tri-grams
ngram_3 <- get_docterm_matrix(docs, 3)
p3 <- generate_word_frequency_plot(ngram_3$wf, "Top Trigrams for Sampled Text")
print(p3)
Here are top 4-grams
ngram_4 <- get_docterm_matrix(docs, 4)
p4 <- generate_word_frequency_plot(ngram_4$wf, "Top 4-grams for Sampled Text")
print(p4)
We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.
Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.
results <- perform_search(ngram_tree, c("data"))
print(results)
## 12 10
## recommended_words "entry" "streams"
## likelihood "0.0606060606060606" "0.0505050505050505"
## 8 7
## recommended_words "recovery" "dating"
## likelihood "0.0404040404040404" "0.0353535353535354"
## 7
## recommended_words "personalize"
## likelihood "0.0353535353535354"
Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.
results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
## 6 6
## recommended_words "just" "respond"
## likelihood "0.5" "0.5"